Case Study #1 - Lending Club Dataset

Created by: Srikrishnan Veeraraghavan



_Describe the dataset and any issues with it._


Variables in Dataset

The dataset can be broadly divided into 6 different categories of variables:

  1. Demographic Data - Contains data pertaining to the nature of employment of the borrower, the state of the borrower and the type of home ownership (rented/owned/mortaged).

  2. Financial Wellbeing - Contains data related to the income of the borrower and the debt to income ratio, which are indicators of the borrower's financial ability to repay a loan, if taken.

  3. Credit History (Delinquency) - Data points pertaining to the borrower's repayment history, specifically signs of failure to repay. Covers the number of delinquencies, severity of the delinquencies and the time since last delinquency.

  4. Credit History (Tradelines) - Broadly covers number, nature and credit amount of tradelines taken by the borrower and inquiries (indicating credit hungriness of the borrower).

  5. Loan Specifications - Details on the particular loan that was approved (Loan Amount, Tenure, Interest Rate etc.)

  6. Loan Status - Pertains to the particular snapshot of data. At the given point of time, what is the outstanding balance and amounts paid by the borrower till then.

Here's a list of all variables classified under each category:

dataset categories

Issues with the Dataset

  1. The dataset contains data points pertaining to a very small period (Jan'18 - Mar'18). The consequence of this is an inability to perform Out Of Time validations on any predictions that are made. Since the data present covers a very small window, it might be possible that any model trained on this data might not generalize well to other time periods as the nature of the financial ecosystem might vary with time.

  2. The snapshot time of the dataset is unclear. There are variables like balance and paid total, which seem to be populated at an unknown point of time after the loan is approved. There are also variables like accounts_opened_24m or months_since_last_credit_inquiry which are more likely captured at the point of loan approval, but could also be data pertaining to the point of time of the snapshot. (For all future analysis we will assume that this kind of data was captured right before the application was approved).

  3. The dataset contains a mix of numeric and categorical data. Categorical variables need to be encoded in order for most predictive models to be able to process them.

  4. The numeric data is not standardized. For many machine learning algorithms that use vector distances (Eg. Gradient Descent, KNN) and some EDA algorithms (Eg. PCA) data needs to be standardized for optimal functioning.

  5. Missing Values - While the dataset is fairly dense for the most part, missing values in variables like emp_length, num_accounts_120d_past_due need to be treated before they are used.

  6. There is a lack of clarity on how certain variables are calculated. For instance, the debt_to_income ratio seems alarmingly high for most of the customers!

Generate a minimum of 5 unique visualizations using the data and write a brief description of your observations. Additionally, all attempts should be made to make the visualizations visually appealing.




1. Quality of Loans

To better understand the quality of the loan portfolio, we can look at the number of loans across each grade and the risk associated with each loan. To evaluate the risk, we define a metric ever_delq_2y as ever delinquent in the last 2 years i.e. if a customer has been delinquent at least once in the last 2 years.

We can see that the quality of loans is good with most of them falling under the B and C grades. These, as expected, are associated with a lower risk (14-16%) as compared to grades D and E (~19%).



2. Risk of Digital Customers

It is hypothesized that customers who prefer digital channels for transactions are less risky than customers who opt for cash.

Here we compare the risk of borrowers who go for cash disbursement vs DirectPay.

We can see that while the number of borrowers for which DirectPay is the preffered method of disbursement is much lower, there is a significant difference in the percentage of borrowers who have been delinquent as compared to Cash.



3. Annual Income by State

The income of a borrower is one of the primary factors in gauging the ability of a borrower to repay their loan. Targeting states with a higher median income would be a preferable sales strategy for lenders to procure credit-worthy customers.

While one could use census data for the same information, Lending Club data would filter out population with a higher credit score, and the trends in income might be different than a more generalized estimate.

Below, we look at a visualization of state-wise median income.

North Dakota, Alaska and New Jersey are among the states that have the highest median income. New Jersey in particular might be a rewarding prospect for procuring new borrowers as it has a fairly high population.



4. Loan Amount and Interest Rates Box Plots

Box plots are a quick way of representing the distributions of variables and comparing distributions across different categories.

Here we can quickly analyse how the disbursed loan amounts and interest rates vary for different purposes of loans.

We can see that that the median is highest for small business loans, which is expected as businesses require larger amounts of capital investment upfront. The median interest rates for vacations are the highest which is a great market as disposable income increases over the years. It is also interesting to note how people take large amounts of loan to pay off their credit cards(upto 37k and more). This might be beneficial as the average interest rate in the market is around 14.5% whereas the median interest rate for the loan is only around 10.9%.



5. Distribution of time since last inquiry

Here we look at the distribution of number of months since the last credit inquiry.

We can see that this variable is right skewed and most borrowers have made inquiries recently in the past few months. This is likely due to two reasons: The data points have been updated post approval of loan (or) this variable includes soft inquiries which many customers might have made to gauge the best interest rates.




Create a feature set and create a model which predicts interest rate using at least 2 algorithms. Describe any data cleansing that must be performed and analysis when examining the data.

Note: Assumptions described as and when they are made.

Data Cleaning

First we will look at missing values in the data set.



For emp_title we will populate the missing values as 'other'.
For emp_length we will populate missing values as 0 since we do not know the emp_title and cannot generalize.




We can see that missing cases of debt_to_income_ratio arise due to annual_income being 0, but debt_to_income_joint is populated for all of them. We can assign this value to the missing variable since the loan approval would have been done based on both applicants.




Variables months_since_last_delinq and months_since_90d_late have missing values due to no history of delinquency. We will assign 999 to these missing values.




We can see that for the variable num_accounts_120d_past_due almost all customers have not gone past 30 days due. So we will assign these null values as 0




For months_since_last_credit_inquiry, the nulls are populated when there have been no inquiries. We will assign these null with a high value, say 999.





-> On further investigation of the dataset, there are some columns that intuitively do not make sense for making predictions.
-> The variables term, loan_status, issue_month, grade, sub_grade, initial_listing_status, disbursement_method, balance, paid_total, paid_principal, paid_interest, paid_late_fees only exist after the loan is approved.
-> The variable installment is directly correlated with the interest rate, which is the target variable.
-> For starters, we will treat verified and unverified income equally and deep dive into it if there is a high positive correlation with the target.
We can drop all these variables from the dataset.




In order to evaluate joint applications and individual applications together, we can create a combined variable annual_income_max = max(annual_income, annual_income_joint).
Using the variables annual_income_max and annual_income, we can evaluate both individual capability and the joint capability of the applicants.
Similarly we can create debt_to_income_min = min(debt_to_income, debt_to_income_joint) which will capture the best performance of the joint applicants.




The variable total_credit_utilized does not make sense by itself, it needs to be a ratio of total_credit_utilized/total_credit_limit to capture the extent to which one is utilizing their credit limit.




The variable earliest_credit_line contains year values, which would be better viewed as vintage. Since all applications are in 2018, we can subtract the value of the variable from 2018.




Now, we will need to encode the categorical variables for them to be mathematically usable.
Since variables like emp_title and state have too many values, naive methods like one hot encoding will be very complicated.
We can go for target encoding, where we take the mean value of the target variable for each group in the variable. Since target encoding is prone to cause overfitting, we can use additive smoothing to make the encoding more robust.
Before doing this, we must first split the training and test data to prevent data leakage.


After this preprocessing, we can move to feature selection.




Feature Selection


Now we have 40 variables (including the target) which is quite large!
We need to reduce the dimensionality to prevent overfitting.
While we can go for dimensionality reduction techniques like PCA, the model loses its interpretability. An alternative approach would be to use random forest to arrive at the feature importance for all the variables and picking the top variables.




Now we have the 20 best predictors for interest rate. We need to ensure that these variables are not correlated with eachother. Pandas has a corr() function that calculates Pearson Correlation and plots the correlation matrix across all the variables.




Putting a cutoff of 0.5 as the maximum correlation, we can manually reduce the features down to the following set of 12 variables.




We can plot the box plots of these variables to understand their distribution.




We can see that many of the variables are skewed and do not follow a normal distribution. Therefore, for treating our data before running our models, we can normalize it instead of standardizing it. (Ideally this should be done on a case-to-case basis)




Before we proceed with building the model, we need to check for multicollinearity. This can be checked using Variance Inflation Factor. A VIF > 5 is considered to be high and variables with high VIF need to be removed.




We can see that the VIF of account_never_delinq_percent and emp_title is quite high! We shall remove these variables iteratively till all VIFs come under the threshold of 5.



Now that multicollinearity is removed, we can proceed to building the models.




Model Development

1. Linear Regression

First model we can try is linear regression. The package Statsmodels is quite convenient as it supplies a comprehensive summary of the model metrics in order to evaluate it.




The p-values of the 1st and 7th variable is high! (Threshold of 0.05 is generally accepted). Re-running after removing the variables:



Results of Linear Regression


The model has an R Squared value of 0.706 which is good, but definitely requires fine-tuning. The p-values are very small indicating that all the variables are significant and contribute to the model.


Now running the model on the test data:



We can see that the Mean Squared Error for the training and test data are similar indicating that the model is able to generalize well for new data and there is not much overfitting happening. (Out of Time Validation might be better here since it is likely that the train and test data have very similar distributions).



2. Artificial Neural Network

Now, let us try another model using Neural Networks. Neural Networks will help capture non-linearity in the data if any.
To keep the model simple and avoid overfitting, we will use a single hidden layer for the neural network architecture. Increasing the complexity of this model beyond this does not yield significantly superior results.


After experimenting with various hyperparameters and optimizers, the following neural network acheives the best results.




Results of Artificial Neural Network


In this case too, the test MSE is similar to the training MSE, so the model generalizes well.



The performance of the Neural Network is quite similar to that of Linear Regression indicating that there is not much non-linearity in the data.



Propose enhancements to the model, what would you do if you had more time?

1. Feature Engineering

2. Business Intuition

3. Hyperparameter Tuning

4. Rigorous performance evaluation

5. Model Development Techniques